Search CORE

11 research outputs found

SQuId: Measuring Speech Naturalness in Many Languages

Author: Bapna Ankur
Camp Joshua
Mackinnon Diana
Parikh Ankur P.
Riesa Jason
Sellam Thibault
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/06/2023
Field of study

Much of text-to-speech research relies on human evaluation, which incurs heavy costs and slows down the development process. The problem is particularly acute in heavily multilingual applications, where recruiting and polling judges can take weeks. We introduce SQuId (Speech Quality Identification), a multilingual naturalness prediction model trained on over a million ratings and tested in 65 locales-the largest effort of this type to date. The main insight is that training one model on many locales consistently outperforms mono-locale baselines. We present our task, the model, and show that it outperforms a competitive baseline based on w2v-BERT and VoiceMOS by 50.0%. We then demonstrate the effectiveness of cross-locale transfer during fine-tuning and highlight its effect on zero-shot locales, i.e., locales for which there is no fine-tuning data. Through a series of analyses, we highlight the role of non-linguistic effects such as sound artifacts in cross-locale transfer. Finally, we present the effect of our design decision, e.g., model size, pre-training diversity, and language rebalancing with several ablation experiments.Comment: Accepted at ICASSP 2023, with additional material in the appendi

arXiv.org e-Print Archive

Evaluating the Cross-Lingual Effectiveness of Massively Multilingual Neural Machine Translation

Author: Arivazhagan Naveen
Bapna Ankur
Firat Orhan
Johnson Melvin
Raman Karthik
Riesa Jason
Siddhant Aditya
Tsai Henry
Publication venue
Publication date: 01/09/2019
Field of study

The recently proposed massively multilingual neural machine translation (NMT) system has been shown to be capable of translating over 100 languages to and from English within a single model. Its improved translation performance on low resource languages hints at potential cross-lingual transfer capability for downstream tasks. In this paper, we evaluate the cross-lingual effectiveness of representations from the encoder of a massively multilingual NMT model on 5 downstream classification and sequence labeling tasks covering a diverse set of over 50 languages. We compare against a strong baseline, multilingual BERT (mBERT), in different cross-lingual transfer learning scenarios and show gains in zero-shot transfer in 4 out of these 5 tasks

arXiv.org e-Print Archive

Association for the Advancement of Artificial Intelligence: AAAI Publications

Multimodal Modeling For Spoken Language Identification

Author: Axelrod Vera
Bapna Ankur
Bharadwaj Shikhar
Dalmia Siddharth
Ganapathy Sriram
Han Wei
Ma Min
Riesa Jason
Ritchie Sandy
Talukdar Partha
van Esch Daan
Vashishth Shikhar
Zhang Yu
Publication venue
Publication date: 19/09/2023
Field of study

Spoken language identification refers to the task of automatically predicting the spoken language in a given utterance. Conventionally, it is modeled as a speech-based language identification task. Prior techniques have been constrained to a single modality; however in the case of video data there is a wealth of other metadata that may be beneficial for this task. In this work, we propose MuSeLI, a Multimodal Spoken Language Identification method, which delves into the use of various metadata sources to enhance language identification. Our study reveals that metadata such as video title, description and geographic location provide substantial information to identify the spoken language of the multimedia recording. We conduct experiments using two diverse public datasets of YouTube videos, and obtain state-of-the-art results on the language identification task. We additionally conduct an ablation study that describes the distinct contribution of each modality for language recognition

arXiv.org e-Print Archive

Google USM: Scaling Automatic Speech Recognition Beyond 100 Languages

Author: Axelrod Vera
Bapna Ankur
Beaufays Françoise
Chen Nanxin
Chen Zhehuai
Chiu Chung-Cheng
Haghani Parisa
Han Wei
Hu Ke
Li Bo
Meng Zhong
Moreno Pedro
Park Daniel S.
Perng Ginger
Prabhavalkar Rohit
Qin James
Ramabhadran Bhuvana
Riesa Jason
Rosenberg Andrew
Sainath Tara
Schalkwyk Johan
Soltau Hagen
Strohman Trevor
Wang Gary
Wang Yongqiang
Wu Yonghui
Zhang Yu
Publication venue
Publication date: 24/09/2023
Field of study

We introduce the Universal Speech Model (USM), a single large model that performs automatic speech recognition (ASR) across 100+ languages. This is achieved by pre-training the encoder of the model on a large unlabeled multilingual dataset of 12 million (M) hours spanning over 300 languages, and fine-tuning on a smaller labeled dataset. We use multilingual pre-training with random-projection quantization and speech-text modality matching to achieve state-of-the-art performance on downstream multilingual ASR and speech-to-text translation tasks. We also demonstrate that despite using a labeled training set 1/7-th the size of that used for the Whisper model, our model exhibits comparable or better performance on both in-domain and out-of-domain speech recognition tasks across many languages.Comment: 20 pages, 7 figures, 8 table

arXiv.org e-Print Archive

Building an English-Iraqi Arabic Machine Translation System for Spoken Utterances with Limited Resources

Author: Behrang Mohit
Daniel Marcu
Jason Riesa
Kevin Knight
Publication venue
Publication date
Field of study

This paper presents an English-Iraqi Arabic speech-to-speech statistical machine translation system using limited resources. In it, we explore the constraints involved, how we endeavored to mitigate such problems as a non-standard orthography and a highly inflected grammar, and discuss leveraging existing plentiful resources for Modern Standard Arabic to assist in this task. These combined techniques yield a reduction in unknown words at translation time by over 40 % and a +3.65 increase in BLEU score over a previous state-of-the-art system using the same parallel training corpus of spoken utterances. Index Terms: speech translation, limited resources, Arabic 1

CiteSeerX

FLEURS: Few-shot Learning Evaluation of Universal Representations of Speech

Author: Axelrod Vera
Bapna Ankur
Conneau Alexis
Dalmia Siddharth
Khanuja Simran
Ma Min
Riesa Jason
Rivera Clara
Zhang Yu
Publication venue
Publication date: 24/05/2022
Field of study

We introduce FLEURS, the Few-shot Learning Evaluation of Universal Representations of Speech benchmark. FLEURS is an n-way parallel speech dataset in 102 languages built on top of the machine translation FLoRes-101 benchmark, with approximately 12 hours of speech supervision per language. FLEURS can be used for a variety of speech tasks, including Automatic Speech Recognition (ASR), Speech Language Identification (Speech LangID), Translation and Retrieval. In this paper, we provide baselines for the tasks based on multilingual pre-trained models like mSLAM. The goal of FLEURS is to enable speech technology in more languages and catalyze research in low-resource speech understanding

arXiv.org e-Print Archive